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Abstract 

In the past decades, cryo-electron microscopy (cryo-EM) has become a powerful tool for 
obtaining high resolution three-dimension (3D) structures of biological macro-molecules. 
A cryo-EM data set usually contains at least thousands of 2D projection images of free 
oriented particles. The characteristics of these projections include having low signal-to-noise 
ratio, containing many misaligned images as outliers, and consisting of a large number of 
clusters due to free orientations. Clustering analysis is a necessary step to group the similar 
orientation images for noise reduction. In this article, we propose a clustering algorithm 
7-SUP that we apply a mixture of g-Gaussian family to model the image distributions and 
employ the minimization of a 7-divergence to derive the estimate of the cluster means and 
finally get the solutions through a self-updating process implementation. 7-SUP copes well 
with the cryo-EM images by its advantages as follows, (a) It resolves the sensitivity issue 
of choosing the number of clusters and cluster initials, (b) It sets a hard influence range for 
each component in the mixture model and hence leads to a robust procedure for estimating 
each of the local clusters, (c) It performs a soft rejection by down weighting deviant points 
from cluster centers and further enhances the robustness, (d) In each iteration, it shrinks 
the mixture model parameter estimates toward cluster centers, and improves the efficiency 
of mixture estimation. 

Key words and phrases: clustering algorithm, cryo-EM images, 7-divergence, fe-means, 
multilinear principal component analysis, ^-Gaussian distribution, robust statistics, self- 
updating process. 
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1 Introduction and motivating data example 

Determining 3D macromolecular structures for large biological machineries stands for mile- 
stones to understand the chemistry underlying vital biological processes. These heroic 
efforts have been achieved mostly by X-ray crystallography and they have been often re- 
warded by the Nobel Prize in Chemistry, for example, awarded to Roger Kornberg in 2006 
for delineating RNA polymerase II of yeast, the machine that expresses genetic information, 
and to V. Ramakrishnan, Tom Steitz and Ada Yonath in 2009 for studies of the structure 
and function of the ribosome, the factory for synthesizing poly-peptide chains according to 
the RNA message. However, many large proteins have resisted all attempts to crystalliza- 
tion. Cryo-EM can focus electrons to obtain the image of macromolecules without the need 
of crystals and it has thereby emerged as a powerful alternative to X-ray crystallography 
for structural determination (Henderson 1995; van Heel et al. 2000; Saibil 2000; Frank 2002, 
2009, 2012; Jiang et al. 2008; Liu et al. 2010; Grassucci et al. 2011). 

In cryo-EM, the sample of macromolecules in their native states is rapidly frozen in a 
thin layer of ice as individual particles (Lepault et al. 1983; Adrian et al. 1984; Dubochet 
2012). Due to the low electron dose used for imaging to reduce radiation damage, together 
with the low contrast arising from little density difference between the macromolecule and 
its surrounding ice medium and poor microscope contrast transfer mechanism (Danev and 
Nagayama 2008; Chang et al. 2010; Hall et al. 2011), the raw cryo-EM images exhibit very 
low signal-to- noise ratio (SNR). A large number of huge pixel (100 x 100) projections of 
the same molecule, corresponding to different and unknown orientations are collected to 
compensate the extremely low SNR. 

The single particle reconstruction of electron cryo-microscopy is to obtain the 3D struc- 
ture of a macromolecule given its 2D projection images at unknown random directions. 
3D reconstruction can be performed by angular reconstitution method (van Heel 1987; van 
Heel et al. 1997; Frank 2006). For the "Ab Initio" approach, each common-line between two 
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projections and the spanned angle are prerequisite. Ideally, they can be decided directly 
from raw images (Crowther 1971; Penczek et al. 1996; Thuman-Commike and Chiu 1997, 
2000). Nevertheless, as aforementioned, the microscope projections are so noisy that correct 
answers to common-lines from the raw images are virtually unavailable. By aligning images 
using in-plane rotation and translation, the noisy images coming from the similar viewing 
angles could be grouped together respectively by their similarity, which is represented by 
their proximity in the high-dimension space of image data points. Averaging the subset 
of images would enhance the signal and reduce the noise. So far, performing the search 
of common-lines using the de-noised class averages of different viewing angles is proven 
successful when the classes are homogeneous and the molecule is not too small. Now the 
solution for de-noising has been reduced to a problem at the 2D, namely image alignment 
and clustering. The clustering analysis is not trivial as the poor SNR still persists (van 
Heel and Frank 1980; van Heel 1984, 1989). 

Here, we focus on the clustering step and assume that the image alignment has been 
carried through. In the vast number of clustering algorithms developed, there are two major 
distinct branches based on different concepts. The model-based methods (Banfield and 
Raftery 1993) model the data as Ylk=i n kf{ x 'i ^k, ffc)- Each individual subject x is clustered 
into argmax fc f(x; p,k, <7fc), where fij~ and &k are the estimations of the mean and variance for 
each cluster k. The distance-based methods enforce the idea of 'distance' that measures the 
similarity between two data points, like the Hierarchical clustering (Hartigan, 1975), the 
/c-means algorithm (McQueen 1967; Lloyd 1982) and the SUP clustering algorithm (Chen 
and Shiu, 2007; Shiu and Chen, 2012). One weakness of hierarchical clustering is that it 
is an irrevocable process in which the mistake made at early steps cannot be corrected at 
later stage. The fc-means algorithm, though very popular, requires the number of clusters 
K and the random initials to execute its iterative process which may not escape from the 
curse of local extremes. 
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From the idea of iterative generated matrices (McQuitty 1968) adopted in the General- 
ized Association Plots (Chen 2002), Shiu and Chen (2012) proposed a self-updating process 
(SUP) which starts with each individual data point as a singleton cluster such that neither 
random initials nor cluster number is required. The SUP process, once initiated, iteratively 
merges or parts data according to the weighted averages over the local neighborhoods de- 
fined by a hard influence region, and finally stops as the data points converge and the 
representative ones emerge as cluster centers if the weighted function is proportional to the 
similarity accordingly. Moreover, the SUP allows extremely small-sized clusters consisting 
of only a few data points or singleton to accommodate outliers, which is an important 
character to warranty robustness. However, the choice of the weighted function is a state 
of the art. 

In this paper, we combine the model-based method and SUP to propose a clustering 
algorithm 7-SUP which not only avoids the assignments of the number of clusters and the 
random initials but also parameterizes the weighted functions in SUP. We apply a more 
broad distance concept 7-divergence (Fujisawa and Eguchi, 2008; Cichocki and Amari, 2010; 
Eguchi et al., 2011) to measure the similarity and a wider range of distribution g-Gaussian 
(Amari and Ohara, 2011; Eguchi et al., 2011) which covers the commonly used Gaussian 
and t distributions and a rich family of distributions with localized domain to model the 
data. This framework allows us to construct the estimator for the mean of each cluster 
through the minimization of the 7-divergence and then we apply the self updating process 
in SUP to iteratively obtain the numerical solutions. 

For application to cryo-EM data, we intentionally choose simulated images as test data 
for several reasons. These simulated data were generated with microscope conditions and 
noise mimicking the experimental images. Importantly, unlike the experimental images, the 
viewing angles of the simulated are known precisely, which would allow us to quantify the 
clustering performance of 7-SUP and the conventional fc-means and other state of the art 
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methods in the structural biology community. In real case, the alignment and clustering is 
convoluted, as reliable clustering depends on reliable alignment, which remains a conundrum 
for cryo-EM images due to their very low SNR (Yang et al., 2012). By testing on simulated 
data, we can decouple the alignment and clustering issues. As to the alignment problem, 
we have investigated two scenarios. First, we study the case of perfectly aligned images; 
secondly, we deliberately introduce misaligned images to each class by in-plane rotation to 
test 7-SUP's performance for outliers. Our simulation results show that 7-SUP performs 
very well especially for those cases including misalignment images. 7-SUP outperforms 
other methods in the way that it is able to isolate those misalignment images as singletons. 

The paper is organized as follows. In Section 2, we have a brief review of 7-divergence 
and (/-Gaussian mixture relevant for 7-SUP. In Section 3, we formulate the 7-SUP clustering 
algorithm as a minimum 7-divergence estimation of (/-Gaussian mixture with A;-means as a 
special case. In Section 4, we show 7-SUP's stability to tuning parameter selection and its 
efficiency. In Section 5, we apply 7-SUP to the simulated cryo-EM images. In Section 6, 
we summarize our conclusions. 

2 A brief review of 7-divergence and qr-Gaussian 

In this section we briefly review the concepts of 7-divergence and (/-Gaussian distribution, 
which are the key technical tools for our 7-SUP clustering algorithm. 

2.1 7-divergence 

The most widely used distribution divergence is probably the Kullback-Leibler divergence 
(KL-divergence) due to its connection to maximum likelihood estimation (MLE). The 7- 
divergence is a generalization of KL-divergence indexed by a power parameter 7. Let 
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Definition 1 (Fujisawa and Eguchi 2008; Cichocki and Amari 2010; Eguchi et al. 2011). 

For f,g€ M, define the ^-divergence D 7 (-||-) and ^y-cross entropy C 7 (-||-) as follows: 

1 f gt(x) 



£> 7 (/|| 5 ) = C y (f\\g) - C 7 (f\\f) with C,(f\\g) = - / f^/(x)i, (1) 

7(7+1)7 \\9\\y + l 

where ||g|| 7 +i = {/ <7 7+1 (x)dx} 1 ^' r+1 ^ is a normalizing constant. 

The 7-divergence can be understood as the divergence function associated with a specific 
scoring function, namely the pseudospherical score (Good, 1971; Gneiting and Raftery, 
2007). The pseudospherical score is given by S(f,x) = / 7 (x)/||/||^ +1 . The associated 
divergence function between / and g can be calculated from equation (7) in Gneiting and 
Raftery (2007) to be 



d(f,g):= J S(f,x)f(x)dx- J S(g,x)f(x)dx = 1 ( 1 + 1)-D^f\\g). (2) 

It implies that d(-, •) and _D 7 (-||-) are equivalent. Moreover, can also be expressed as 

a functional Bregman divergence (Frigyik et al., 2008). Note that ||g|| 7+ i is a normalizing 
constant so that the cross entropy enjoys the property of being projective invariant, i.e., 
C 7 (/||c<7) = C 7 (/||<7), V c > (Eguchi et al., 2011). Thus, the 7-divergence is actually 
defined on the part of unit sphere in M. , precisely, on 

7 :={/e^:||/|| 7+ i = l}. 

By Holder's inequality, it can be shown that, for f,g € f2 7 , D ry {f\\g) > and the equality 
holds if and only if g = Xf for some A > (Eguchi et al., 2011). For a given /, minimizing 
Dry(f\\g) over g in a certain function class is equivalent to minimizing the 7-loss function 

L lJ {g) = ~hxU g^ x )x f(x)dx\ + -^ l \nU g^ + \x)dx} . (3) 

Note that in the limiting case, lim 7 _j.o D 1 (f\\g) = Do(f\\g) = f f(x)ln{f(x)/g(x)}dx 
gives the KL-divergence. The MLE which corresponds to the minimization of the KL 
divergence -Dq(-||-) has been shown to be optimal in parameter estimation in the sense of 



having minimum asymptotic variance. This optimality comes with the cost that MLE relies 
heavily on the correctness of model specification. Therefore, MLE or the minimization of the 
KL divergence is not robust against model deviation and outliers. On the other hand, the 
minimum 7-divergence estimate is shown to be super robust (Fujisawa and Eguchi, 2008) 
against data contamination. It is this robustness property makes 7-divergence suitable for 
local learning (Mollah et al. 2010) and, hence, the purpose of clustering. 

2.2 g-Gaussian 

The (/-Gaussian distribution is a generalization of the Gaussian distribution by replacing 
the usual exponential function with the (/-exponential 

1 

exp (u) = {1 + (1 — q)u}]^ q , where {x} + = max{x, 0}. 
Let S p denote the collection of all strictly positive definite p x p symmetric matrices. 

Definition 2 (modified from Amari and Ohara 2011; Eguchi et al. 2011). For a fixed 
q € ( — 00, 1 + |), define the p-variate q-Gaussian distribution G q (pi, X) with parameters 
9 = (fi, S) G W x S p to have the probability density function 

f q (x; 9) = C ^ exp g {u(x; 9)} , x £ W, (4) 

(V ZTTfP y |2j| 



where u(x; 9) = —\{x — //) T S 1 (x — //) and c Pj(? is a constant so that f f q (x; 6)dx = 1. 

The constant c Pi9 is given below (cf. Eguchi et al., 2011): 

f r ( 1+ ^ + B) for -oo <q<1 

r(i+^) ' tOT 00< « <i ' 

1, for q -)• 1, (5) 

( 9 -l)P/2 r( -Xr ) 

r ' forl<(/<l + |. 

The class of the q-Gaussian distributions covers some well-known distributions. When 
lim^i, the (/-Gaussian distribution is reduced to the Gaussian distribution. For 1 < q < 



1 + |, the g-Gaussian distribution is equivalent to the multivariate i-distribution. This can 
be seen by setting v = 2/(q — 1) — p > 0. Then, the g-Gaussian density function in (4) is 
proportional to 

{ p+v 
i + hx-^f(^^y\x-A 2 , (6) 

which is exactly the pdf of a p-variate i-distribution (apart from constant term) with location 
and scale parameters (//, ^jpX) and degrees of freedom v. Depending on the choice of q, 
the support of G q (fi, X) also differs. For 1 + ~ > q > 1 (i.e., for Gaussian distribution and 
t-distribution) , the support of G q (fi,Y<) is the entire MP. For q < 1, however, the support 
of G q (n, X) depends on q in the form of 

|x:(x-^fS- 1 (x-^)< T l-}. (7) 

Thus, by choosing q < 1, it sets a hard influence range and leads to perform data rejection 
in our clustering algorithm. Note that if X ~ G q {fx, S) with q < 1 + ^p^, 1 then E(X) = /i 
and Cov(X) = 2+(p J )(1 _ q) S. 

2.3 Minimum 7-divergence for estimating a q-Gaussian 

The 7-divergence is a discrepancy measure for two functions in Ai. Its minimum can then 
be used as a criterion to approximate an underlying probability density function / from a 
certain model class Mq parameterized by 9 G © C M. m . From (1) and (3), in the population 
level, / is estimated by 

/* = argminl} 7 (/||p) = argminC 7 (/||sO = argminL 7i/ (sr). (8) 
g£M e geM& g&M e 

In this study, we consider Mq to be the family of (/-Gaussian distributions G q (fj,, X) with 

9 = (n, S) introduced in Definition 2. Then, for any given values of 7 and q, the loss 

1 For q < 1 + — 7— , it ensures the existence of the f moment of X. 
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function L 1 j{g) in (8) evaluated at g = f q (x;9) € A4q becomes 



■lb 

7 



7 



In 



7 



expJu^))^ 1 



7 + 1 



In 



c Pi9 exp g (u(x;6>)) 



(7+1) 



fix 



7+1 



/(*) 



/ {exp q (u(v;6))y +1 dv 



dx 



-I In 

7 



dx 



f(x){fxt3.(x;n, — |—X;)) 7+1 , 

L 7 + 1 7 + 1 > 

Hence, minimizing L~ / j{f q (x;9)} over possible values of 9 is equivalent to maximizing 

J /(x)|S|~5(^+t) [ e xp g {u(x;fl)}] 7 (2x. (9) 

For high dimensional data, however, it is unpractical to estimate the covariance matrix £ 
and its inverse. Note also that our main interest in this study is to find cluster centers. We 
thus employ £ = a 2 I p as our working model. By taking derivative of (9) with respect to /i, 
we get the stationarity for the maximizer /i* for any fixed a 2 : 

J xf(x)[exp q {u(x; fx* ,a 2 )}]^-^dx _ f xw{x; fi* ,a 2 )dF(x) 



/' 



(10) 



/ /(x)[expju(x;^*,f7 2 )}p-( 1 -9)(ix / w{x\ (J,*, a 2 )dF(x) ' 

where w(x; fi* , a 2 ) = [exp „{u(x; n* , a 2 )}] 7 ^ 1 " 9 ) is the weight function and F(x) is the 
cumulative distribution function corresponds to f(x). 

Suppose we have observed the data {xj}f =1 , the sample analogue of (i* can be obtained 
naturally by replacing F{x) in (10) with the empirical distribution function of {xj}f =1 . This 
gives the the stationarity for /x*: 

Y^ =1 Xiw(xi;fi*,a 2 ) 
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Notice that the weight function w depends on the choice of 7-divergence and (/-Gaussian 
used in the estimation criterion. When 7 = 1 — q, w(x; fi* , a 2 ) = 1 and 11* in (11) becomes 
the sample mean re -1 Y^=i x ii which is n °t robust. 
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One can see that the weight function w assigns the contribution of Xi to /i* . Thus, 
a robust estimator should encourage the property that smaller weight is given for those 
Xi farther away from li* and zero for outliers. These can be achieved by combining the 
minimization of the 7-divergence and the (/-Gaussian distribution. In particular, when 
q < 1, we have 



(12) 



0, |k-/,*|| 2 >g. 



w(x; \x* ,a ) 



3 7-SUP 

In this section, we introduce our clustering method, 7-SUP, which minimizes the 7-divergence 
on the mixture of (/-Gaussian distributions and employs the SUP algorithm to solve the nu- 
merical solutions. 

3.1 Motivation 

Suppose that / is a mixture of k components, i.e., 

k 

f{x) = Y,Khfh{x). (13) 
h=i 

The aim is to propose a clustering method which can tell apart these k components sampling 
from /. For the purpose of robustness, we model each as a (/-Gaussian distribution and 
use the minimum 7-divergence criterion to develop an estimation scheme for simultaneously 
estimating all k components, where k is automatically determined during the estimation 
procedure. 

Consider the trivial case that k = 1 first. The problem becomes to find the best q- 
Gaussian distribution to approximate the empirical distribution of all samples. From the 
discussions in the previous section, the solution that minimizes the 7-divergence satisfies 
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(10). When F is the empirical distribution, we have 



J xw(x; fjf ,<r 2 )dF(x) _ YjiXiw(xi] // ,a 2 ^ 



f w(x;fi*,a 2 )dF(x) ^ w(a;;;/x*,cr 2 ) 
A numerical way to obtain fi* is iteratively updating fj, using the above formula. 



(14) 



3.2 Proposed algorithm 

Suppose we have collected data {xj}" =1 . We would like to iteratively group them with 
group representatives , . . . , pfj, f }, where the number of clusters kg in the £ th iteration 
is data-driven and varies through the self-updating process. We start with n clusters (i.e., 
k$ = n) with initial representatives {/tf^ = Xi}™ =1 . Based on the stationary equation (10), 
for £ = 0, 1, 2, . . ., consider the following self-updating process: 

_ / xw{x;fxf\a 2 )dF^\x) 



jw{x-pJf\a 2 )dF^{x) 
The update (15) can be written as 

V n (0) .(0) 
(1) _ L,j=l w ij Pj 



, where F™(x) = ± Hfij < x). (15) 



/' 



yn (0) 

2^j=i w ij 
where the weights are given by 



/' 



(2) _ L,j=l w ij »j 



„(oo) 



(0 
i.i 



exp. 



1 
2^2 



7-(l-g) 



If we further define 



p = 7 — (1 — (?) > and s = s((7, 7) 

the weights in (17) can be re-expressed as 

2^1 V* 



1-9 



7 -(1-q) 



>0, 



1 



sp_ 
2a 2 



exp^ 



P 
2a 2 



In updating the local model representatives fip in the I th iteration, 7-SUP in (16) takes 
average over candidate model representatives fij , and the data points shrink toward their 



(16) 



(17) 



(18) 



(19) 
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final cluster centers {/ij°°' ) }" =1 , which have k (= koo) distinctive components denoted by 



{jj>h}h=i- The we ight wfj-, being non-negative and decreasing with respect to ||//to — //!• , 
in (17), assures the convergence of 7-SUP (Shiu and Chen, 2012). The principle of down 
weighting is important for robust model fitting (Basu et al., 1998; Field and Smith, 1994; 
Windham, 1995). However, we want to emphasize here that our down weighting by in 
(16) is with respect to models instead of to data. 2 That is, Fn\x) is also updated during 
the iteration of (15) for the case with respect to model, while it is always fixed at F^\x) 
for the case with respect to data. Such a weighting scheme with respect to model is more 
efficient than with respect to data. See Example 2 in Section 4 for numerical study. 

The 7-SUP algorithm can be further simplified. Define the scaling parameter r to be 



to to 1 



r = a 



_ {2 + (p + 2)s}p- 

Let %i = Xi/r and //to = fi( £ )/T, the weights in (19) can be re-expressed as 



(21) 



to 



1 



exp x _ 



sp_ 
2a 2 



/' 



to _ r Xi) 



1 



l/s 



a!" 



2 + (p + 2)s 



/' 



to _ r .(i) 



l/s 



(0 



(22) 



2 + (p + 2)s 

From (22), to implement 7-SUP, we need to determine the values of (s,t). It is found 
in our numerical studies in Section 4 that 7-SUP is quite insensitive to the choice of s. 
We thus suggest to choose a small value of s in practical implementation, which usually 
gives satisfactory results. In summary, 7-SUP starts with n clusters using each (scaled) 
individual data point //-°^ = Xj/r as a cluster member, which avoids the problem of random 
initials. Eventually, 7-SUP converges to certain k clusters, where k depends on the tuning 



2 For down weighting with respect to data, one has wfj = exp 1 _ s ^^f 5 "!! 2 '* — Aj^H^) an d the update 

v-^n (0) v-^ri (1) 



.(00) 



,(°) 



ij 



En 
J = l W 



That is, the weighted model average is replaced by weighted data average. 
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parameters (s, r), but otherwise is completely data-driven. At the end of the updating 
process, we have the cluster centers {ft^ = T-jlfi}i =1 and the cluster membership assignment 
{cj}™ =1 for each data point. The exact 7-SUP clustering algorithm is summarized in Table 1. 

Table 1: 7-SUP clustering algorithm 

Inputs: Data matrix X G §l nxp , n instances with p variables; 

Tuning parameters (s, r) 
Outputs: Number of clusters k and cluster centers {fih}h=v 

Cluster membership assignment {cj}" =1 for each of {xi}f =1 . 

begin 

Iter <- 

start with: \x\ <— Xi/r, % = 1, . . . , n 
repeat 

for % — 1 : n 

Wij <— exp x _ s (y— 2+ (l + 2) s ~ V'iW-i) ■ 
z i <~ YTj=\ *r™ l3 Wl N (update every point) 
end 

for i — 1 : n 
/li 4- Zi 

end 
Iter <- Iter + 1 
until convergence 

output distinct {r ■ fa, 1 < z < n} and cluster membership 

end 

Note: The parameter r is linear proportional to the hard influence region radius that defines the 
similarity inside a cluster. We observe a phase transition in cryo-EM image analysis that can suggest 
a reasonable region for r. 

3.3 Properties of 7-SUP 

The success of our proposed 7-SUP largely depends on the following properties. 
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• It adopts a (/-Gaussian mixture model, where the determination of number of compo- 
nents is data-driven. It starts with each individual data point as a singleton cluster. 
That is, it starts with a mixture of n components of g-Gaussians. 

• The q-Gaussian, with q < 1, sets a hard influence range for each component and 
completely rejects data outside this range. See the hard influence range reflected in 
the weights (12). 

• It estimates the model parameters by the minimum 7-divergence. The minimum 7- 
divergence performs a soft rejection by down weighting the influence of data deviant 
from the cluster centers, which further enhances the clustering robustness. 

• The self-updating process updates model weights, where the update process shrinks 
the fitted mixture model toward cluster centers in each iteration. Such a shrinkage 
update acts as if the effective temperature is iteratively decreasing, see Figure 2, 
so that it improves the efficiency of mixture estimation. The effective temperature 
is defined as cj/s, where cr| is an estimate for between-cluster variance in the £ th 
iteration. Refer to Examples 1 and 2 in Section 4 for more details and for efficiency 
comparison. 

• Note that 7-SUP aims to simultaneously extract all relevant clusters without the need 
of specifying the number of components and initials. It allows singleton or extremely 
small-sized clusters to accommodate potential outliers. 

3.4 On the case with q = 1 and 7 = 

The case, when 7 = and q = 1, corresponds to the minimum Kullback-Leibler divergence, 
or equivalently maximum likelihood, estimation of a mixture of usual Gaussian components. 
The minimum KL divergence over mixtures of k components of Gaussian distributions with 
EM algorithm leads to fc-means clustering (Banerjee et al., 2005), where k is predetermined. 
It is known that /c-means has some drawbacks, such as it needs to specify the number of 
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classes k, its clustering result depends on random initials, it is not robust to outliers, and it 
does not perform well when k is large. Note that the /c-means clustering is not the same as 
7-SUP with 7 = 0. The EM procedure on minimizing KL divergence updates with respect 
to data (see (20)), while 7-SUP updates with the model. 

We end this section with a remark below, which explains why the usual Gaussian mixture 
model (q = 1) together with MLE (7 = 0) is not robust, and that q < 1 and 7 > in our 
7-SUP method cover a reasonable range. 

Remark 1 (Fujisawa and Eguchi 2008; Eguchi et al. 2011). // q-Gaussian is used for 
modeling, we should adopt a 7 value with 7 > 1 — q, so that the minimum ^-divergence 
estimation can be robust against deviation from model assumption. 



In this section, we show by numerical examples that the performance of 7-SUP is quite 

stable with respect to the selection of tuning parameter s. We also show that 7-SUP, 

which adopts a down weighting scheme with respect to model, is more efficient in model 

parameter estimation than the usual robust model fitting, which adopts a down weighting 

scheme with respect to data. More detailed explanations for the difference between down 

weighting respect to data versus with respect to model are given in Examples 1 and 2. 

Example 1 (Stability with respect to s-selection and performance comparison). The data 
with sample size 100 is generated from a mixture of two normal distributions with density 
function (for simplicity, assume a 2 = 1) 



We assume that the true location parameter of interest is \i\ = in the first component 
of (23), but the observable data is contaminated by another normal distribution with mean 
/i2 = —7, where tt represents the proportion of contamination. Here, the estimator is for 
the mean parameter [i\ of the major component. 

The robustifying model fitting (RMF) is a robust estimation method proposed by Wind- 
ham (1995). When f(x;9) is the density function of Gaussain distribution, parameter es- 
timation by RMF is accomplished by weighted average, where the weights are updated 



4 Numerical study 




(23) 



16 



iteratively. Let {f(x; 8) : 8} be the class of model pdfs and 8^' be the parameter estimate 
of 8 in the £ th iteration. RMF first re-weights the data contribution by the model density 
through w*(x;§W) = {f(x;8^)}^ s , where s is a positive tuning parameter. The updated 
estimate is then obtained from an iteration similar to (15), but with the original data 

and with the weights w*(x, 8^), 

A main difference between 7-SUP and RMF is that, 7-SUP does weighted model average 
in its updates, while RMF does weighted data average in its updates. In view of this point, 
our aim here is to compare the performance of 7-SUP with that of RMF, in estimating 
the location parameter fj,\ of the major component. Both 7-SUP and RMF are used to fit 
the data using mixture of {f(x; /j,, a 2 ) : fi £ R}, where <r 2 is the sample variance estimate 
using entire data, and the center of the largest cluster is used as the estimate of the major 
component mean Simulation results with 100 replicates under 7r = 0.3 over different 
choices of s values are placed in Figure 1. It can be seen that the performance of 7-SUP is 
rather stable over various values of s, while that of RMF fluctuates more and is sensitive 
to the choice of s in the left boundary range s € [0.2, 0.6]. 




Figure 1: (Left) Means of different methods in estimating fi = under different tuning parameter s. 
(Right) Zoom-in for a better view. 

Simulation results at the optimal choice of s are further provided in Table 2, which 
gives the means and standard errors of the estimates from different methods. It can be 
seen that 7-SUP has a smaller standard errors than RMF, especially in the case of large 
contamination ir = 0.3. The superior performance of 7-SUP comes from the shrinkage 
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strategy built in the self-updating process. It acts as if the effective temperature parameter 
is continuously decreasing (see Figure 2), and as if the weight function is getting uniform, 
while the updating proceeds. In summary, 7-SUP is insensitive to the choice of the tuning 
parameter s, and is more robust against the influence of outliers than RMF. 



0.315 - 

0.31 
0.305 - 
0.3 - 
0.295 - 

0.29 - 
0.285 - 



1.5 2 2.5 

number of iterations 



Figure 2: 7-SUP acts as if the effective temperature parameter, is continuously decreasing, or as 
if the weight function is getting uniform, as the updating proceeds. (The effective temperature is 
defined as aj/s; where crj = ^jj Yli=i(pf^ ~ P^) 2 < — jjjj Ya=i PV * anc ' ^ ' s tne number of 
major clusters in £ th iteration.) 



Table 2: Means (standard errors) of different methods in estimating n — under different proportions 
of contamination ix. 



Method \ contamination 


7T = 0.1 


7T = 0.3 


7-SUP 
RMF 


0.29 x 10~ 4 (0.011) 
9.49 x 10- 4 (0.115) 


0.70 x 10" 4 (0.108) 
2.64 x 10" 4 (0.246) 



Example 2 (Efficiency comparison). The purpose of this example is to show that the 
proposed 7-SUP has a better efficiency in estimating the location parameter. Data {xj}™ =1 
are generated from a p-variate t distribution with 3 degrees of freedom and location and 
scale parameters fi = P and S = a 2 I, where P is a p-vector of zeroes. 7-SUP is run 
on a few s values in the range [0.15,0.8]. When converged, most of {/^°°^}" =1 merge 
together with possibly a few data points being left out to form extremely small-sized or 
even singleton clusters. We then use the center from the largest cluster as the estimator 
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for fj,. As we have mentioned, 7-SUP can be regarded as an iteratively reweighted model 
average (cf. iteratively reweighted data average). Its update is to weight on model rather 
than weight on data. Based on the same system of stationary equations, the usual robust 
estimator based on minimum 7-divergence (Mollah et al., 2010) takes the form of iteratively 
reweighted data average: 

Note the difference between F^p in (15) of 7-SUP and F^ in (24) of the 7-estimator. We 
compare the 7-SUP estimator (15), the usual robust 7-estimator (24), sample mean and 
the MLE for t distribution. It is known that sample mean will inherit large variation due to 
heavy tail probabilities of t distribution. On the other hand, MLE is shown to be optimal 
when the model is correctly specified. Thus, sample mean and MLE are used as references 
for the worst and the best scenarios. 

Simulation results with (n,p) 6 {(10, 100), (100, 100)} and 500 replicates are provided 
in Figure 3. Reported are the MSE curves (times n) for 4 methods, where MSE is defined 
by \\fi — /u || !/p averaged over R replicate runs and thus 

R lift „ll2 



n x MSE = n x 



J_ \ ^ 1 1 ~ HI2 

R^ p 



Not surprisingly, the best performer is MLE and the worst one is the sample mean, while 
7-SUP and 7-estimator have intermediate performances. 7-SUP performs very closely to 
the optimal estimator MLE for every setting of p and choice of s. It indicates the superiority 
of 7-SUP which provides a robust estimation for location parameter, even when the model 
is not correctly specified. Although both 7-SUP and 7-estimator adopt the same minimum 
7-divergence criterion to estimate parameters, our simulation results show that 7-SUP that 
weights on model does uniformly perform better than 7-estimator that weights on data. 
This observation reflects the potential of "weight on model" in alleviating poor influence 
due to outliers. 



5 Application to cryo-EM images 

To evaluate the performance of 7-SUP on cryo-EM images, we use the image data created 
from a model molecule so that we can compare our result with the known solution. A total of 
128 distinct 2-D images with 100 x 100 pixels are generated by projecting the X-ray crystal 
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p = 10, n = 100; based on 500 replicate runs 



p = 100, n = 100; based on 500 replicate runs 



y-sup estimator 
y-estimator 
sample-mean-t 
mle-t 





- y-sup estimator 

- y-estimator 

- sample-mean-t 

- mle-t 



0.5 1 1 1 1 1 1 1 0.5 L 

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 

Figure 3: Efficiency comparison. The 7-SUP algorithm, which is based on iteratively reweighted 
model average is more efficient than the 7-estimator that is iteratively reweighted data average. MLE 
(light blue) and sample mean (red) are used as two references. 

structure of RNA polymerase II filtered to 20 Angstrom in equally spaced (angle-wise) 
orientations. 3 Each image is then convoluted with electron microscopy contrast transfer 
function (defocus 2 /xm). Finally, 6400 images are randomly sampled with replacement 
from these 128 projections with iid Gaussian noise N(0, £ 2 ) added, where £ 2 is chosen to 
reflect the signal-to- noise ratio (SNR) covering from .08 to .2 on average. This procedure to 
simulate the cryo-EM images is commonly used in the cryo-EM community (Chang et al., 
2010; Singer et al., 2010; Sorzano et al., 2010; Hall et al., 2011). 

Before comparing the performance of various clustering algorithm, we conduct dimension 
reduction on this image data set. Instead of using principal component analysis (PCA), we 
apply multilinear principal component analysis (MPCA, Lu et al. 2008; Hung et al. 2012) 
to reduce the dimension from 10000 to 100, as MPCA has been shown to be more efficient 
in dimension reduction for image sets containing core contents (Hung et al., 2012). The 
extracted images from MPCA then enter our clustering analysis. 

We implement fc-means, clustering 2D (CL2D) (Sorzano et al., 2010), and their variant 
A:-means + for comparison. Given the number of clusters, unlike A:-means which separates 



3 Data source: The X-ray model of RNA polymerase II is from Protein Data Bank (PDB: 1WCM). 
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the data into K clusters in each single iteration, CL2D bisects the clusters iteratively until 
the pre-specified K clusters are constructed. Another difference between them is that CL2D 
adopts the correntropy (a Gaussian kernel) as the measure of distance, while /c-means uses 
the usual Euclidean norm. To avoid outputting extremely uneven sized clusters usually 
produced in /c-means for dealing with large number of clusters, CL2D and /c-means + define 
a number N m i n that they dismiss the clusters whose size is below N m i n and split the largest 
cluster to two once a dismiss is executed. Here, we let N m i n = 30. fc-means + differs from 
CL2D in that it preserves the Euclidean norm. We observe that all the mistakes that 7- 
SUP makes is to merge two clusters or more in one. We thus propose a 7-SUP + to improve 
7-SUP by further using fc-means to separate those clusters whose size is greater than 70 in 
this case study. In real application, the threshold for the unevenly large cluster size should 
be adjusted according to the ratio of the total number of images and the number of clusters 
expected. 

We follow the idea of the purity index in Manning et al. (2008) to evaluate the clustering 
results. For each output cluster, the purity number is the number of images of the majority 
class in this cluster. The overall purity number is the sum over all clusters. Formally, the 
purity number is defined as 



where {q} are sets of true classes, {ujj} are sets of clusters, and | • | is the cardinality of the 
set. The impurity number is the difference between the total number of images N and the 
purity number, 



Note that the purity is usually defined to be the ratio of the purity number and the total 
number of the images. Here we do not normalize it by the total number for better presen- 
tation of the simulation results. The impurity number is for the perfect clustering result, 
but the zero impurity number does not guarantee a perfect clustering. This number can 



purity 




impurity = N — purity = N 




3 
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not pick up the mistake made by splitting one class into two or more clusters. Here we 
propose a complementary number, termed as c-impurity, which does the same thing but 
exchanging the roles of the true classes and the output clusters in defining the impurity 
number. That is, 



The c-impurity value is able to pick up the mistakes by splitting a cluster to two or more 
ones. In Tables 3, 4 and 5, we present both the impurity and c-impurity numbers for each 
clustering result. 

The three methods, CL2D and A;-means and A;- means" 1 ", all require the pre-specification 
of K and the initial assignments for each image, while 7-SUP and 7-SUP + do not. CL2D 
further needs a tuning parameter for the kernel width. We assign the best parameters for 
them and present the best out of 10 runs. In contrast to this, the output of 7-SUP is 
determined as long as its parameters s and r are fixed. We show a strategy to choose the 
parameter r through a phase transition diagram in Subsection 5.1. 

To reflect the nature of the experiment, we test two circumstances. In Case-1, the 
replicates of the same orientation with iid noise are all perfectly aligned. In Case-2, we 
design two sets that 10% and 20% of the images are misaligned, where these misaligned 
images can be treated as outliers and should be identified individually without clustering 
with other correctly aligned images. We conduct the analysis under different noise levels 
£ = 40,50,60, so that the corresponding SNRs are 0.19,0.12,0.08 on average. This reflects 
the low signal and high noise nature of cryo-EM images. Since the clustering result depends 
on the selection of tuning parameters, we report the best result only. 

5.1 Case-1: clustering with perfectly-aligned images 

Table 3 presents the performance comparison on 7-SUP, 7-SUP" 1 ", CL2D, fc-means" 1 " and k- 
means for the simulated images with perfect alignment. It can be seen that the performance 



c-impurity = N 
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of &-means is poor for all £ values, even we correctly specify K = 128. This reflects the 
drawback of fc-means when the noise level is high and when the number of classes is large. 
For small noise level £ = 40, all the other four methods give perfect clustering. Their 
clustering accuracies slowly decay as £ increases. The errors that 7-SUP makes are always 
to wrongly combine two clusters as a single one, which can be easily fixed by 7-SUP + . 
CL2D mixes up (4, 4) images when £=60. It seems not expected for fc-means + to have 
(33, 34) images mixed up when £=50 but make it perfect when £=60. This is due to the 
10 random initials mentioned above. 

For 7-SUP, the parameters (s, r) need to be determined. We use the data example with 
£ = 40 to demonstrate the effect of these tuning parameters and provide guidance to select 
them. In fact, we observe that the performance of 7-SUP is quite stable for a small value of 
s, and it is r that really matters as to the clustering result. 7-SUP gives similar results for 
0.01 < s < 0.03 and, hence, we fix s = 0.025 for the rest of analysis. The scale parameter r 
is proportional to the support region of the g-Gaussian distribution, which allows the users 
to tune the similarity level inside a cluster. When r is small enough, 7-SUP will always 
output 6400 clusters (each individual cryo-EM image forms one cluster) and when r is too 
large, we will have a single cluster (all the images belong to the same cluster). We report 
the numbers of clusters from 7-SUP under various values of r in Figure 4. 

We further observe a phase transition of the number of clusters verses r which would be 
very helpful to choose r. When r is below 83, 7-SUP outputs 6400 clusters, which means 
that each cluster contains one image. When r reaches 83, the cluster number becomes 128, 
a perfect result. There exists no intermediate result between these two for this data set as 
shown in Figure 4. Moreover, the cluster number remains at 128 for quite a wide range of 
r 6 [83, 105]. Recalling that 7-SUP updating ignores the influence of data outside a certain 
range determined by r, we construct this data set such that each cluster has similar within- 
cluster distance. When the corresponding scale parameter r is small enough that there is no 
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influence between any two images, then 7-SUP leads to 6400 clusters with each individual 
as one cluster. On the other hand, when the scale parameter reaches a critical value, the 
images in the same cluster can start attracting each other and will finally merge. This 
explains why a phase transition occurs. We observe similar phase transition phenomena for 
various noise structures, of which some may not happen at the perfect cluster result, but 
not far from it. Thus, the value at which the phase transition occurs can be treated as a 
starting value for selecting a reasonable range of r. 

5.2 Case-2: clustering with misaligned images 

It is highly possible that cryo-EM images can not be well-aligned due to their low SNR. A 
good cluster method should be robust in the presence of misaligned images (outliers). Here, 
two experiments are designed to test the performance of these clustering algorithms when 
10% and 20% misaligned images exist. From each of the 128 clusters, the pre-specified 
percentage of the images are randomly chosen to be rotated in the order of 7.5, 15, 22.5, 
30, 37.5 and 45 angular degrees (°) clockwise. Each rotated image does not share the 
same signal pattern with the images in its original cluster and with the other misaligned 
ones either. An ideal situation is to treat each of these misaligned images as a singleton 
cluster. Including these singleton clusters, the total cluster number becomes 771 for 10% 
misalignment and 1410 for 20%, while the meaningful cluster number remains 128. The 
results are presented in Table 4 for 10% misalignment and Table 5 for 20% misalignment. 

7-SUP again makes the mistakes of merging two clusters in most cases and these mistakes 
can be easily fixed by 7-SUP + . As shown in Table 4, for the 10% misalignment case, 7- 
SUP has 7 misaligned images merged to correct clusters after "re-clustering" by 7-SUP + 
even when £ = 60. On the other hand, fc-means + and CL2D can do nothing about the 
misaligned images which are 643 in the 10% misalignment case, such that their default 
mistakes are 643 in the c-impurity category. Similarly, there exist 1282 default mistakes 
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composed by the misaligned images for the 20% misalignment case. 

We observe that the performance of CL2D and /c-means + are more seriously impacted 
when the default mistake ratio reach 20%. While CL2D makes no other mistake at £ = 40 
and (1, 1) mistakes at £ = 50 and £ = 60 in addition to the default 643 misalignment 
images in the 10% case, it makes 421-444 more image merges in addition to the default 
1282 ones for the 20% case. /c-means + shares similar pattern when compared on the 10% 
and 20% cases. This conveys a message that as a large number of outliers are enforced to 
enter the clusters, their cluster representatives can be contaminated to some point that it 
is not easy for them to find their correct cluster members. 

6 Conclusion 

We combine a model-based clustering method 7-estimator and a distance-based clustering 
method SUP to propose 7-SUP to meet the image clustering challenges in cryo-EM analysis. 
The cryo-EM images have the characteristics of having low signal-to-noise ratio, containing 
many misaligned images as outliers, and consisting of a large number of clusters due to 
free orientations. Because of its capability to label the outliers, 7-SUP could pick up 
the misalignment images and create the possibility for further correcting the misaligned 
images. As the noise intensity increase, 7-SUP may make mistakes by merging two or 
more clusters as one. However, this can be easily fixed by fc-means. Thus, we are able to 
present a successful application for 7-SUP on the simulated cryo-EM images. 7-SUP has 
some crucial advantages summarized as follows. 

• 7-SUP does not require any initials, and the number of clusters k is data-driven. 

• In 7-SUP, the parameters (7, q, a) involved in 7-divergence and g-Gaussian mixture 
are transformed to the scale parameter r, see (21), and the power parameter s, see (18) 
and (19). We show that 7-SUP has robust performance over s in many scenarios. Once 
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s is chosen, the phase transition scheme may suggest a reasonable range for r. This 
transformation to (r, s) and the observation of the phase transition greatly reduce the 
difficulty in selecting the tuning parameters. 

• 7-SUP adopts an iterative shrinkage estimation. In each iteration, the updating pro- 
cess shrinks the mixture model estimation toward cluster centers. It acts as if the 
effective temperature parameter in 7-SUP is continuously decreasing and leads to a 
more efficient estimation scheme. 

These advantages would benefit the cryo-EM community in their image analysis. 
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Table 3: Clustering result with perfect alignment images 
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Table 4: Clustering result with 10% mis-alignment images 
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Table 5: Clustering result with 20% mis-alignment images 
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Figure 4: The numbers of clusters by 7-SUP under various values of r. A phase transition occurs 
when the scale parameter r is 83. 
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Figure 5: The three rows are three projections with different orientations which A;-means merge them 
as one cluster and 7-SUP separates them perfectly when SNR=.1048. The columns show different 
intensities of noise added. 
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